ä¹³č…ŗē™ŒčÆŠę–­é¢„ęµ‹åˆ†ęžĀ¶

1. 锹目概述¶

2. ę•°ę®å¤„ē†Ā¶

2.1 ę•°ę®åŠ č½½äøŽé¢„č§ˆĀ¶

  • 加载CSVę•°ę®é›†
  • äŗ¤äŗ’å¼č”Øę ¼å±•ē¤ŗę•°ę®ę ·ęœ¬
  • åŸŗęœ¬ē»Ÿč®”äæ”ęÆåˆ†ęž

2.2 ę•°ę®é¢„å¤„ē†Ā¶

  • åˆ é™¤ę— å…³åˆ—(ID)å’Œē¼ŗå¤±å€¼
  • ę ‡ē­¾ę•°å€¼åŒ–(M→1, B→0)
  • åˆ†å±‚ęŠ½ę ·åˆ’åˆ†č®­ē»ƒé›†/测试集
  • SMOTEčæ‡é‡‡ę ·å¤„ē†ē±»åˆ«äøå¹³č””
  • ē‰¹å¾ę ‡å‡†åŒ–

2.3 ę•°ę®åÆč§†åŒ–Ā¶

  • čÆŠę–­ē»“ęžœåˆ†åøƒé„¼å›¾
  • ē‰¹å¾åˆ†åøƒē®±ēŗæå›¾ļ¼ˆåˆ†ē»„å±•ē¤ŗļ¼‰
  • å…³é”®ē‰¹å¾č”åˆåˆ†åøƒēŸ©é˜µ
  • ē‰¹å¾ē›øå…³ę€§ēƒ­åŠ›å›¾
  • RFE 特征选择

2.4 特征巄程¶

  • å¼‚åøøå€¼ę£€ęµ‹ļ¼ˆZ-score + IQR + Isolation Forest)
  • ä½Žę–¹å·®ē‰¹å¾čæ‡ę»¤
  • é«˜ē›øå…³ę€§ē‰¹å¾ē§»é™¤
  • RFEē‰¹å¾é€‰ę‹©äøŽåÆč§†åŒ–

2.5 ęØ”åž‹ęž„å»ŗäøŽčÆ„ä¼°Ā¶

  • åŸŗēŗæęØ”åž‹ļ¼šé€»č¾‘å›žå½’+SMOTE
  • ä¼˜åŒ–ęØ”åž‹ļ¼šXGBoost + RandomizedSearchCV + EarlyStopping + Threshold Optimization
  • ę€§čƒ½åÆ¹ęÆ”ļ¼šå‡†ē”®ēŽ‡ć€å¬å›žēŽ‡ć€F1值等
  • ROCę›²ēŗæäøŽPRę›²ēŗæåˆ†ęž

3. å…³é”®ē»“ęžœĀ¶

  • ęœ€ä½³ęØ”åž‹ę€§čƒ½åÆ¹ęÆ”
  • ē‰¹å¾é‡č¦ę€§åˆ†ęž
  • é™ē»“ę•ˆęžœčÆ„ä¼°

ä¹³č…ŗē™ŒčÆŠę–­é¢„ęµ‹åˆ†ęžĀ¶

1. 锹目概述¶

  • ē›®ę ‡ļ¼šęž„å»ŗäŗŒå…ƒåˆ†ē±»ęØ”åž‹é¢„ęµ‹ä¹³č…ŗē™ŒčÆŠę–­ē»“ęžœļ¼ˆę¶ę€§/良性)
  • ę•°ę®é›†ļ¼šåØę–Æåŗ·ę˜Ÿä¹³č…ŗē™ŒčÆŠę–­ę•°ę®é›†ļ¼ˆ30个特征)
  • ęŠ€ęœÆę ˆļ¼šPandas, Scikit-learn, XGBoost, Plotly, Matplotlib
InĀ [96]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=False)  
plt.rcParams['font.sans-serif'] = ['SimHei']  
plt.rcParams['axes.unicode_minus'] = False
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

2. ę•°ę®å¤„ē†Ā¶

2.1 ę•°ę®åŠ č½½äøŽé¢„č§ˆĀ¶

InĀ [97]:
df = pd.read_csv('data.csv')

pd.set_option('display.max_columns', None)  
pd.set_option('display.width', None)        
pd.set_option('display.max_colwidth', None) 

from IPython.display import HTML

styled_html = df.head(10).style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#404040'), ('color', 'white')]},
        {'selector': 'td', 'props': [('border', '1px solid #dee2e6')]}
    ]) \
    .format({
        'radius_mean': '{:.2f}', 'texture_mean': '{:.2f}', 'area_mean': '{:.2f}',
        'smoothness_mean': '{:.4f}', 'compactness_mean': '{:.4f}', 'concavity_mean': '{:.4f}',
        'concave points_mean': '{:.4f}', 'symmetry_mean': '{:.4f}', 'fractal_dimension_mean': '{:.4f}',
    }) \
    .highlight_max(color='lightgreen') \
    .highlight_min(color='salmon') \
    .to_html()

html_with_scroll = f"""
<div style='overflow-x: auto; max-width: 100%;'>
    {styled_html}
</div>
"""
display(HTML(html_with_scroll))

print("\nę•°ę®åŸŗęœ¬ē»Ÿč®”äæ”ęÆļ¼š")
styled_stats = df.describe().style.format('{:.2f}')
stats_html = styled_stats.to_html()
stats_with_scroll = f"""
<div style='overflow-x: auto; max-width: 100%;'>
    {stats_html}
</div>
"""
display(HTML(stats_with_scroll))
Ā  id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.800000 1001.00 0.1184 0.2776 0.3001 0.1471 0.2419 0.0787 1.095000 0.905300 8.589000 153.400000 0.006399 0.049040 0.053730 0.015870 0.030030 0.006193 25.380000 17.330000 184.600000 2019.000000 0.162200 0.665600 0.711900 0.265400 0.460100 0.118900 nan
1 842517 M 20.57 17.77 132.900000 1326.00 0.0847 0.0786 0.0869 0.0702 0.1812 0.0567 0.543500 0.733900 3.398000 74.080000 0.005225 0.013080 0.018600 0.013400 0.013890 0.003532 24.990000 23.410000 158.800000 1956.000000 0.123800 0.186600 0.241600 0.186000 0.275000 0.089020 nan
2 84300903 M 19.69 21.25 130.000000 1203.00 0.1096 0.1599 0.1974 0.1279 0.2069 0.0600 0.745600 0.786900 4.585000 94.030000 0.006150 0.040060 0.038320 0.020580 0.022500 0.004571 23.570000 25.530000 152.500000 1709.000000 0.144400 0.424500 0.450400 0.243000 0.361300 0.087580 nan
3 84348301 M 11.42 20.38 77.580000 386.10 0.1425 0.2839 0.2414 0.1052 0.2597 0.0974 0.495600 1.156000 3.445000 27.230000 0.009110 0.074580 0.056610 0.018670 0.059630 0.009208 14.910000 26.500000 98.870000 567.700000 0.209800 0.866300 0.686900 0.257500 0.663800 0.173000 nan
4 84358402 M 20.29 14.34 135.100000 1297.00 0.1003 0.1328 0.1980 0.1043 0.1809 0.0588 0.757200 0.781300 5.438000 94.440000 0.011490 0.024610 0.056880 0.018850 0.017560 0.005115 22.540000 16.670000 152.200000 1575.000000 0.137400 0.205000 0.400000 0.162500 0.236400 0.076780 nan
5 843786 M 12.45 15.70 82.570000 477.10 0.1278 0.1700 0.1578 0.0809 0.2087 0.0761 0.334500 0.890200 2.217000 27.190000 0.007510 0.033450 0.036720 0.011370 0.021650 0.005082 15.470000 23.750000 103.400000 741.600000 0.179100 0.524900 0.535500 0.174100 0.398500 0.124400 nan
6 844359 M 18.25 19.98 119.600000 1040.00 0.0946 0.1090 0.1127 0.0740 0.1794 0.0574 0.446700 0.773200 3.180000 53.910000 0.004314 0.013820 0.022540 0.010390 0.013690 0.002179 22.880000 27.660000 153.200000 1606.000000 0.144200 0.257600 0.378400 0.193200 0.306300 0.083680 nan
7 84458202 M 13.71 20.83 90.200000 577.90 0.1189 0.1645 0.0937 0.0599 0.2196 0.0745 0.583500 1.377000 3.856000 50.960000 0.008805 0.030290 0.024880 0.014480 0.014860 0.005412 17.060000 28.140000 110.600000 897.000000 0.165400 0.368200 0.267800 0.155600 0.319600 0.115100 nan
8 844981 M 13.00 21.82 87.500000 519.80 0.1273 0.1932 0.1859 0.0935 0.2350 0.0739 0.306300 1.002000 2.406000 24.320000 0.005731 0.035020 0.035530 0.012260 0.021430 0.003749 15.490000 30.730000 106.200000 739.300000 0.170300 0.540100 0.539000 0.206000 0.437800 0.107200 nan
9 84501001 M 12.46 24.04 83.970000 475.90 0.1186 0.2396 0.2273 0.0854 0.2030 0.0824 0.297600 1.599000 2.039000 23.940000 0.007149 0.072170 0.077430 0.014320 0.017890 0.010080 15.090000 40.680000 97.650000 711.400000 0.185300 1.058000 1.105000 0.221000 0.436600 0.207500 nan
ę•°ę®åŸŗęœ¬ē»Ÿč®”äæ”ęÆļ¼š
Ā  id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
count 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 569.00 0.00
mean 30371831.43 14.13 19.29 91.97 654.89 0.10 0.10 0.09 0.05 0.18 0.06 0.41 1.22 2.87 40.34 0.01 0.03 0.03 0.01 0.02 0.00 16.27 25.68 107.26 880.58 0.13 0.25 0.27 0.11 0.29 0.08 nan
std 125020585.61 3.52 4.30 24.30 351.91 0.01 0.05 0.08 0.04 0.03 0.01 0.28 0.55 2.02 45.49 0.00 0.02 0.03 0.01 0.01 0.00 4.83 6.15 33.60 569.36 0.02 0.16 0.21 0.07 0.06 0.02 nan
min 8670.00 6.98 9.71 43.79 143.50 0.05 0.02 0.00 0.00 0.11 0.05 0.11 0.36 0.76 6.80 0.00 0.00 0.00 0.00 0.01 0.00 7.93 12.02 50.41 185.20 0.07 0.03 0.00 0.00 0.16 0.06 nan
25% 869218.00 11.70 16.17 75.17 420.30 0.09 0.06 0.03 0.02 0.16 0.06 0.23 0.83 1.61 17.85 0.01 0.01 0.02 0.01 0.02 0.00 13.01 21.08 84.11 515.30 0.12 0.15 0.11 0.06 0.25 0.07 nan
50% 906024.00 13.37 18.84 86.24 551.10 0.10 0.09 0.06 0.03 0.18 0.06 0.32 1.11 2.29 24.53 0.01 0.02 0.03 0.01 0.02 0.00 14.97 25.41 97.66 686.50 0.13 0.21 0.23 0.10 0.28 0.08 nan
75% 8813129.00 15.78 21.80 104.10 782.70 0.11 0.13 0.13 0.07 0.20 0.07 0.48 1.47 3.36 45.19 0.01 0.03 0.04 0.01 0.02 0.00 18.79 29.72 125.40 1084.00 0.15 0.34 0.38 0.16 0.32 0.09 nan
max 911320502.00 28.11 39.28 188.50 2501.00 0.16 0.35 0.43 0.20 0.30 0.10 2.87 4.88 21.98 542.20 0.03 0.14 0.40 0.05 0.08 0.03 36.04 49.54 251.20 4254.00 0.22 1.06 1.25 0.29 0.66 0.21 nan

2.2 ę•°ę®é¢„å¤„ē†Ā¶

ę•°ę®ęø…ę“—ć€ę ‡ē­¾ē¼–ē ć€å¤„ē†ē±»åˆ«äøå¹³č””ć€ē‰¹å¾ę ‡å‡†åŒ–

InĀ [98]:
data = pd.read_csv('data.csv')
data = data.drop(['id', 'Unnamed: 32'], axis=1, errors='ignore')  # åˆ é™¤ę— ē”Øåˆ—
data = data.dropna()  # åˆ é™¤å«ē¼ŗå¤±å€¼ēš„č”Œ
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})  # ę ‡ē­¾ę•°å€¼åŒ–
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
print("ę•°ę®ęø…ę“—å®Œęˆļ¼Œę ·ęœ¬ę•°:", data.shape[0], "特征数:", X.shape[1])

# åˆ†å‰²č®­ē»ƒé›†å’Œęµ‹čÆ•é›†ļ¼ˆåœØ SMOTE å’Œę ‡å‡†åŒ–ä¹‹å‰ļ¼‰
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# åÆ¹č®­ē»ƒé›†åŗ”ē”Ø SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# ę ‡å‡†åŒ–ē‰¹å¾
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_smote = scaler.fit_transform(X_train_smote)
X_test = scaler.transform(X_test)

print("ę•°ę®é¢„å¤„ē†å®Œęˆļ¼Œč®­ē»ƒé›†ę ·ęœ¬ę•°:", X_train_smote.shape[0], "ęµ‹čÆ•é›†ę ·ęœ¬ę•°:", X_test.shape[0])
ę•°ę®ęø…ę“—å®Œęˆļ¼Œę ·ęœ¬ę•°: 569 特征数: 30
ę•°ę®é¢„å¤„ē†å®Œęˆļ¼Œč®­ē»ƒé›†ę ·ęœ¬ę•°: 570 ęµ‹čÆ•é›†ę ·ęœ¬ę•°: 114

2.3 ę•°ę®åÆč§†åŒ–Ā¶

åˆ›å»ŗäŗ¤äŗ’å¼é„¼å›¾å±•ē¤ŗč‰Æę€§/ę¶ę€§ē—…ä¾‹åˆ†åøƒęÆ”ä¾‹

InĀ [99]:
import plotly.graph_objects as go
from IPython.display import HTML
import pandas as pd

# ē¤ŗä¾‹ę•°ę®
y = pd.Series([0, 1, 0, 0, 1, 1, 0, 1, 1, 1])  # 假设 0 č”Øē¤ŗč‰Æę€§ļ¼Œ1 蔨示恶性

labels = ['良性', 'ꁶꀧ']
values = y.value_counts().sort_index()  

fig = go.Figure(data=[go.Pie(
    labels=labels,
    values=values,
    textinfo='label+percent+value', 
    textposition='inside',           
    insidetextorientation='radial',  
    marker=dict(colors=['#40E0D0', '#FFD700'], line=dict(color='#FFFFFF', width=2)), 
    hoverinfo='label+percent+value', 
    hole=0.2,                        
)])

fig.update_layout(
    title_text='čÆŠę–­ē±»åž‹åˆ†åøƒļ¼ˆč‰Æę€§ vs 恶性)-黄云翔',
    title_font_size=20,
    title_x=0.5,  
    legend_title_text='čÆŠę–­ē±»åž‹',
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5),  
    showlegend=True,
    paper_bgcolor='white', 
    plot_bgcolor='white',  
    font=dict(color='black')
)

html_content = fig.to_html(full_html=False, include_plotlyjs='cdn')
display(HTML(f'<div style="width:800px; margin:0 auto; background-color: white">{html_content}</div>'))

åˆ†ē»„ļ¼ˆ5个特征/ē»„ļ¼‰åˆ›å»ŗę ‡å‡†åŒ–åŽēš„ē‰¹å¾åˆ†åøƒē®±ēŗæå›¾ļ¼Œå±•ē¤ŗę•°ę®åˆ†åøƒå’Œå¼‚åøøå€¼

InĀ [100]:
from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
import pandas as pd
from IPython.display import HTML

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
colors = ['indianred', 'mediumseagreen', 'dodgerblue', 'plum', 'darkkhaki',
          'lightsalmon', 'gold', 'mediumturquoise', 'darkorange', 'lightgreen']
all_colors = colors * (len(X.columns) // len(colors) + 1) 

# åˆ›å»ŗäø€äøŖåˆ—č”Øę„å­˜å‚Øę‰€ęœ‰ē”Ÿęˆēš„ HTML 内容
html_contents = []

for group_idx in range(0, len(X_scaled_df.columns), 5):  
    group_cols = X_scaled_df.columns[group_idx:group_idx + 5]
    fig = go.Figure()
    for i, (col, color) in enumerate(zip(group_cols, all_colors[:len(group_cols)])):
        fig.add_trace(go.Box(
            y=X_scaled_df[col],
            name=col,
            boxpoints='outliers',
            jitter=0.5,  
            pointpos=0,
            whiskerwidth=0.2,  
            fillcolor=color,
            marker=dict(size=3, color=color, line=dict(width=1, color='black')),  
            line=dict(width=2, color='black'),  
            opacity=0.8,
            hovertemplate=f'特征: {col}<br>ę ‡å‡†åŒ–å€¼: %{{y}}<extra></extra>'
        ))
    fig.update_layout(
        title=f'ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾ - 组 {group_idx // 5 + 1}',
        title_font_size=16,  
        yaxis_title='ę ‡å‡†åŒ–å€¼ (0-1)',
        xaxis=dict(tickangle=45, tickfont=dict(size=10, color='black')),  
        showlegend=True,
        height=500,
        width=1200,  
        margin=dict(l=40, r=40, t=50, b=80), 
        plot_bgcolor='white',
        paper_bgcolor='white',
        font=dict(color='black'),
        boxmode='group',
        boxgroupgap=0.05, 
        boxgap=0.2      
    )
    
    # å°†å›¾č”Øč½¬ę¢äøŗ HTML å¹¶å­˜å‚Ø
    html_content = fig.to_html(full_html=False, include_plotlyjs='cdn')
    html_contents.append(html_content)
    print(f"å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ {group_idx // 5 + 1} ē»„ļ¼ˆ{', '.join(group_cols)})。")

# ę˜¾ē¤ŗę‰€ęœ‰äŗ¤äŗ’å¼å›¾č”Ø
for i, html_content in enumerate(html_contents):
    display(HTML(f'<h3>ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 {i+1}</h3>'))
    display(HTML(f'<div style="width:1200px; margin:0 auto">{html_content}</div>'))
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 1 ē»„ļ¼ˆradius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean)。
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 2 ē»„ļ¼ˆcompactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean)。
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 3 ē»„ļ¼ˆradius_se, texture_se, perimeter_se, area_se, smoothness_se)。
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 4 ē»„ļ¼ˆcompactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se)。
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 5 ē»„ļ¼ˆradius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst)。
å·²ē”Ÿęˆļ¼šä¹³č…ŗē™Œē‰¹å¾åˆ†åøƒē®±åž‹å›¾ļ¼Œē¬¬ 6 ē»„ļ¼ˆcompactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst)。

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 1

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 2

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 3

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 4

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 5

ä¹³č…ŗē™Œē‰¹å¾ę ‡å‡†åŒ–åˆ†åøƒå›¾-黄云翔 - 组 6

é€‰ę‹©äøŽčÆŠę–­ē›øå…³ę€§ęœ€é«˜ēš„4个特征

åˆ›å»ŗę•£ē‚¹ēŸ©é˜µå±•ē¤ŗē‰¹å¾é—“å…³ē³»

åˆ›å»ŗē›“ę–¹å›¾å±•ē¤ŗē‰¹å¾č¾¹ē¼˜åˆ†åøƒ

InĀ [101]:
import plotly.express as px
import numpy as np
import seaborn as sns
from IPython.display import HTML

# 1. ę•£ē‚¹ēŸ©é˜µå›¾
correlation = data.corr()['diagnosis'].abs().sort_values(ascending=False)[1:5] 
key_features = correlation.index.tolist()

fig_scatter = px.scatter_matrix(
    data,
    dimensions=key_features,
    color='diagnosis',
    color_continuous_scale=['#FF9999', '#99FF99'],
    title='å…³é”®ē‰¹å¾č”åˆåˆ†åøƒäøŽč¾¹ē¼˜åˆ†åøƒ-黄云翔',
    labels={col: col.replace('_', ' ').title() for col in key_features},
    height=800,
    width=1000
)

fig_scatter.update_traces(
    diagonal_visible=True, 
    showupperhalf=False, 
    marker=dict(size=6, opacity=0.6)
)

fig_scatter.update_layout(
    title_font_size=20,
    title_x=0.5,
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(size=10, color='black'),
    legend_title_text='čÆŠę–­ē»“ęžœ',
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5)
)

# 2. 盓方图
fig_hist = px.histogram(
    data,
    x=key_features,
    color='diagnosis',
    marginal="box",
    opacity=0.7,
    barmode='overlay',
    height=400,
    width=1000,
    title='å…³é”®ē‰¹å¾č¾¹ē¼˜åˆ†åøƒäøŽē®±åž‹å›¾-黄云翔'
)

fig_hist.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(size=10, color='black'),
    legend_title_text='čÆŠę–­ē»“ęžœ',
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5)
)

scatter_html = fig_scatter.to_html(full_html=False, include_plotlyjs='cdn')
hist_html = fig_hist.to_html(full_html=False, include_plotlyjs='cdn')

display(HTML("""
<style>
.plot-container {
    margin: 20px auto;
    border: 1px solid #eee;
    box-shadow: 0 0 10px rgba(0,0,0,0.1);
    padding: 15px;
}
</style>
"""))

display(HTML('<h2 style="text-align:center">äŗ¤äŗ’å¼åÆč§†åŒ–åˆ†ęž</h2>'))
display(HTML('<div class="plot-container">' + scatter_html + '</div>'))
display(HTML('<div class="plot-container">' + hist_html + '</div>'))

fig_scatter.write_html("scatter_matrix.html", full_html=True)
fig_hist.write_html("histogram_boxplot.html", full_html=True)

äŗ¤äŗ’å¼åÆč§†åŒ–åˆ†ęž

åˆ›å»ŗę‰€ęœ‰ē‰¹å¾ēš„ē›øå…³ę€§ēƒ­åŠ›å›¾ļ¼ŒčÆ†åˆ«é«˜åŗ¦ē›øå…³ē‰¹å¾

InĀ [102]:
corr_matrix = X.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('ē‰¹å¾ē›øå…³ę€§ēƒ­å›¾-黄云翔')
# plt.show()
fig.write_html("chart.html", include_plotlyjs='cdn')
No description has been provided for this image

åˆ›å»ŗå°ęē“å›¾å±•ē¤ŗå…³é”®ē‰¹å¾åœØč‰Æ/ę¶ę€§čÆŠę–­äø­ēš„åˆ†åøƒå·®å¼‚

InĀ [103]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# 关键特征选择
key_features = ['radius_mean', 'texture_mean', 'area_mean', 'concavity_mean']

feature_names = {
    'radius_mean': '平均半径',
    'texture_mean': 'å¹³å‡ēŗ¹ē†',
    'area_mean': 'å¹³å‡é¢ē§Æ',
    'concavity_mean': 'å¹³å‡å‡¹é™·åŗ¦',
    'diagnosis': 'čÆŠę–­ē»“ęžœ'
}

scaler = StandardScaler()
data_scaled = data.copy()
data_scaled[key_features] = scaler.fit_transform(data[key_features])
data_melted = pd.melt(data_scaled, id_vars='diagnosis', value_vars=key_features, var_name='variable', value_name='value')
plt.figure(figsize=(12, 6))
sns.violinplot(x='variable', y='value', hue='diagnosis', data=data_melted, split=True, inner='quart')
plt.title('ę ‡å‡†åŒ–åŽēš„å…³é”®ē‰¹å¾å°ęē“å›¾-黄云翔')
plt.xticks(rotation=45)
current_labels = [label.get_text() for label in plt.gca().get_xticklabels()]
plt.xticks(range(len(current_labels)), [feature_names[label] for label in current_labels])
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles, ['良性', 'ꁶꀧ'], title='čÆŠę–­ē»“ęžœ')

plt.xlabel('特征')
plt.ylabel('ę ‡å‡†åŒ–å€¼')
plt.rcParams['font.sans-serif'] = ['SimHei']  
plt.rcParams['axes.unicode_minus'] = False  


plt.tight_layout()  
plt.savefig('violin_plot_key_features_chinese.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

使用 StandardScaler åÆ¹ē‰¹å¾ę•°ę® X čæ›č”Œę ‡å‡†åŒ–å¤„ē†ä»„ę¶ˆé™¤äøåŒē‰¹å¾é‡ēŗ²ēš„å½±å“ļ¼Œå¹¶å°†ę•°ę®é›†ęŒ‰ 80% 和 20% ēš„ęÆ”ä¾‹åˆ’åˆ†äøŗč®­ē»ƒé›†å’Œęµ‹čÆ•é›†ć€‚åœØ RFE ē‰¹å¾é€‰ę‹©é˜¶ę®µļ¼Œä»„é€»č¾‘å›žå½’ęØ”åž‹ä½œäøŗåŸŗē”€åˆ†ē±»å™Øļ¼Œåˆ›å»ŗ RFE åÆ¹č±”é€’å½’ę¶ˆé™¤ęœ€äøé‡č¦ēš„ē‰¹å¾ļ¼Œč®­ē»ƒåŽé€ščæ‡ ranking_ å±žę€§čŽ·å–ē‰¹å¾ęŽ’åļ¼ŒęŽ’å 1 äøŗęœ€é‡č¦ē‰¹å¾ć€‚ē‰¹å¾ę•°é‡ę€§čƒ½čÆ„ä¼°ēŽÆčŠ‚ļ¼ŒéåŽ†ä»Ž 1 åˆ°å…ØéƒØē‰¹å¾ēš„ę‰€ęœ‰åÆčƒ½ę•°é‡ļ¼ŒåÆ¹ęÆäøŖē‰¹å¾ę•°é‡ kļ¼Œé€‰å–ęŽ’åå‰ k ēš„ē‰¹å¾ļ¼Œåˆ©ē”Ø 5 ęŠ˜äŗ¤å‰éŖŒčÆčÆ„ä¼°č®­ē»ƒé›†ę€§čƒ½ļ¼ŒåŒę—¶åœØęµ‹čÆ•é›†äøŠčÆ„ä¼°ęØ”åž‹å‡†ē”®ēŽ‡å¹¶č®°å½•äŗ¤å‰éŖŒčÆäøŽęµ‹čÆ•å¾—åˆ†ć€‚ē»“ęžœåˆ†ęžę—¶ļ¼Œē”®å®šäŗ¤å‰éŖŒčÆå’Œęµ‹čÆ•é›†äøŠę€§čƒ½ęœ€ä½³ēš„ē‰¹å¾ę•°é‡ļ¼Œęå–åÆ¹åŗ”ęœ€ä½³ē‰¹å¾å­é›†ēš„ęØ”åž‹ē³»ę•°ä½œäøŗē‰¹å¾é‡č¦ę€§ć€‚

InĀ [104]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
from IPython.display import HTML


scaler = StandardScaler()
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 初始化RFEå’ŒęØ”åž‹
model = LogisticRegression(random_state=42, max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=1, step=1)
rfe.fit(X_train, y_train)

# čŽ·å–ē‰¹å¾ęŽ’åå’Œę”ÆęŒåŗ¦
feature_ranking = rfe.ranking_
feature_support = rfe.support_

# čÆ„ä¼°äøåŒē‰¹å¾ę•°é‡äø‹ēš„ę€§čƒ½
n_features = X_train.shape[1]
n_features_range = range(1, n_features + 1)
cv_scores = []
test_scores = []

for k in n_features_range:
    selected_features = feature_ranking <= k
    X_train_selected = X_train[:, selected_features]
    X_test_selected = X_test[:, selected_features]
    
    model = LogisticRegression(random_state=42, max_iter=1000)
    cv_score = cross_val_score(model, X_train_selected, y_train, cv=5, scoring='accuracy').mean()
    model.fit(X_train_selected, y_train)
    test_score = accuracy_score(y_test, model.predict(X_test_selected))
    
    cv_scores.append(cv_score)
    test_scores.append(test_score)

# ę‰¾å‡ŗęœ€ä½³ē‰¹å¾ę•°é‡
best_cv_score_idx = np.argmax(cv_scores)
best_test_score_idx = np.argmax(test_scores)
best_cv_k = n_features_range[best_cv_score_idx]
best_test_k = n_features_range[best_test_score_idx]

# čŽ·å–ęœ€ä½³ē‰¹å¾é›†ēš„ē³»ę•°
best_features = feature_ranking <= best_cv_k
model.fit(X_train[:, best_features], y_train)
feature_importances = np.abs(model.coef_[0])
sorted_idx = np.argsort(feature_importances)[::-1][:15]

# åˆ›å»ŗåÆč§†åŒ–å›¾č”Ø
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=(
        f'ē‰¹å¾ęŽ’å-黄云翔 (1=ęœ€é‡č¦)',
        f'ęœ€ä½³ē‰¹å¾å­é›†ęƒé‡-黄云翔 (k={best_cv_k})',
        'ē‰¹å¾ę•°é‡åÆ¹ęØ”åž‹ę€§čƒ½ēš„å½±å“-黄云翔'
    ),
    vertical_spacing=0.15,
    horizontal_spacing=0.05
)

# 图1: ē‰¹å¾ęŽ’å
top_features = np.argsort(feature_ranking)[:15]
fig.add_trace(
    go.Bar(
        x=[f'特征 {i}' for i in top_features],
        y=feature_ranking[top_features],
        marker_color='royalblue',
        name='ē‰¹å¾ęŽ’å',
        hovertemplate='特征: %{x}<br>ęŽ’å: %{y}<extra></extra>'
    ),
    row=1, col=1
)

# 图2: ē‰¹å¾é‡č¦ę€§
fig.add_trace(
    go.Bar(
        x=[f'特征 {i}' for i in sorted_idx],
        y=feature_importances[sorted_idx],
        marker_color='mediumseagreen',
        name='ē‰¹å¾ęƒé‡',
        hovertemplate='特征: %{x}<br>ꝃ重: %{y:.4f}<extra></extra>'
    ),
    row=2, col=1
)

# 图3: ęØ”åž‹ę€§čƒ½
fig.add_trace(
    go.Scatter(
        x=list(n_features_range),
        y=cv_scores,
        mode='lines+markers',
        line=dict(color='mediumorchid', width=2),
        marker=dict(size=8),
        name='äŗ¤å‰éŖŒčÆå¾—åˆ†',
        hovertemplate='ē‰¹å¾ę•°é‡: %{x}<br>CV得分: %{y:.3f}<extra></extra>'
    ),
    row=3, col=1
)
fig.add_trace(
    go.Scatter(
        x=list(n_features_range),
        y=test_scores,
        mode='lines+markers',
        line=dict(color='deepskyblue', width=2),
        marker=dict(size=8),
        name='ęµ‹čÆ•å¾—åˆ†',
        hovertemplate='ē‰¹å¾ę•°é‡: %{x}<br>ęµ‹čÆ•å¾—åˆ†: %{y:.3f}<extra></extra>'
    ),
    row=3, col=1
)

# ę·»åŠ ęœ€ä½³ē‰¹å¾ę ‡č®°
fig.add_vline(
    x=best_cv_k, 
    line=dict(dash='dash', color='red', width=1.5),
    annotation_text=f'ęœ€ä½³CV: k={best_cv_k}',
    annotation_position='top right',
    row=3, col=1
)
fig.add_vline(
    x=best_test_k, 
    line=dict(dash='dash', color='green', width=1.5),
    annotation_text=f'ęœ€ä½³ęµ‹čÆ•: k={best_test_k}',
    annotation_position='bottom right',
    row=3, col=1
)

# ę›“ę–°ę•“ä½“åøƒå±€
fig.update_layout(
    height=1200,
    width=1000,
    title_text='RFEē‰¹å¾é€‰ę‹©åˆ†ęž',
    title_font=dict(size=24, family="Arial", color='black'),
    title_x=0.5,
    showlegend=True,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="center",
        x=0.5
    ),
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(size=12, color='black'),
    margin=dict(l=80, r=80, t=100, b=80)
)

# ę›“ę–°å­å›¾åøƒå±€
fig.update_xaxes(title_text='特征瓢引', row=1, col=1, title_font=dict(size=14))
fig.update_yaxes(title_text='ęŽ’å (1=ęœ€é‡č¦)', row=1, col=1, title_font=dict(size=14))
fig.update_xaxes(title_text='特征瓢引', row=2, col=1, title_font=dict(size=14))
fig.update_yaxes(title_text='ęƒé‡ē»åÆ¹å€¼', row=2, col=1, title_font=dict(size=14))
fig.update_xaxes(title_text='ē‰¹å¾ę•°é‡', row=3, col=1, title_font=dict(size=14))
fig.update_yaxes(title_text='å‡†ē”®ēŽ‡', row=3, col=1, title_font=dict(size=14))

# ē”Ÿęˆäŗ¤äŗ’å¼HTML内容
html_content = fig.to_html(
    full_html=False,
    include_plotlyjs='cdn',
    config={
        'responsive': True,
        'displayModeBar': True,
        'scrollZoom': True
    }
)

# åˆ›å»ŗå±…äø­ę˜¾ē¤ŗēš„HTML容器
centered_html = f"""
<div style="
    display: flex;
    justify-content: center;
    align-items: center;
    flex-direction: column;
    width: 100%;
    padding: 20px;
    background-color: #f9f9f9;
    border-radius: 10px;
    box-shadow: 0 4px 8px rgba(0,0,0,0.1);
">
    <h2 style="text-align: center; color: #333; margin-bottom: 20px;">
        RFEē‰¹å¾é€‰ę‹©äŗ¤äŗ’å¼åˆ†ęž
    </h2>
    <div style="width: 1000px; height: 1200px;">
        {html_content}
    </div>
</div>
"""

display(HTML(centered_html))

RFEē‰¹å¾é€‰ę‹©äŗ¤äŗ’å¼åˆ†ęž

2.4 特征巄程¶

1. å¼‚åøøå€¼ē»¼åˆę£€ęµ‹äøŽå¤„ē†

  • å¤šę–¹ę³•ę£€ęµ‹ļ¼šē»“åˆZ-score(>3σ)态IQR(1.5å€č·)和IsolationForest(10%ę±”ęŸ“ēŽ‡)čÆ†åˆ«å¼‚åøøå€¼

  • ę•ˆęžœčÆ„ä¼°ļ¼šåÆ¹ęÆ”ē§»é™¤å‰åŽę•°ę®åˆ†åøƒå˜åŒ–ļ¼ˆå‡å€¼/ę ‡å‡†å·®ļ¼‰åŠé€»č¾‘å›žå½’ęØ”åž‹å‡†ē”®ēŽ‡ęå‡

  • åÆč§†åŒ–ļ¼šåˆ†ē»„ē®±ēŗæå›¾å±•ē¤ŗåŽŸå§‹ę•°ę®å¼‚åøøē‚¹åˆ†åøƒ

2. ē‰¹å¾ē­›é€‰ä¼˜åŒ–

  • ä½Žę–¹å·®čæ‡ę»¤ļ¼šē§»é™¤ę–¹å·®<0.01ēš„ē‰¹å¾ļ¼ˆå¦‚ēØ³å®šäøå˜ēš„ę— ę•ˆęŒ‡ę ‡ļ¼‰

  • é«˜ē›øå…³å‰”é™¤ļ¼šåˆ é™¤ē›øå…³ē³»ę•°>0.9ēš„å†—ä½™ē‰¹å¾ļ¼ˆéæå…å¤šé‡å…±ēŗæę€§ļ¼‰

  • RFEē‰¹å¾é€‰ę‹©ļ¼šļ¼Œé€ščæ‡é€’å½’ē‰¹å¾ę¶ˆé™¤ē”®å®šē‰¹å¾é‡č¦ę€§ęŽ’åļ¼ŒåŠØę€čÆ„ä¼°äøåŒē‰¹å¾ę•°é‡äø‹ēš„ęØ”åž‹ę€§čƒ½ļ¼ˆ5ęŠ˜äŗ¤å‰éŖŒčÆļ¼‰ļ¼Œč‡ŖåŠØé€‰ę‹©ęœ€ä¼˜ē‰¹å¾å­é›†ļ¼ˆęœ€ä½³k=äŗ¤å‰éŖŒčÆå³°å€¼åÆ¹åŗ”ē‰¹å¾ę•°ļ¼‰

3. å¤šē»“åÆč§†åŒ–åˆ†ęž

  • ē‰¹å¾ęŽ’åå›¾ļ¼šęŸ±ēŠ¶å›¾å±•ē¤ŗRFEčÆ„å®šēš„TOP15é‡č¦ē‰¹å¾

  • ęƒé‡åˆ†ęžå›¾ļ¼šé€»č¾‘å›žå½’ē³»ę•°ē»åÆ¹å€¼åę˜ ē‰¹å¾å½±å“åŠ›

  • ę€§čƒ½å…³ē³»å›¾ļ¼šåŒę›²ēŗæåÆ¹ęÆ”ē‰¹å¾ę•°é‡äøŽéŖŒčÆé›†/ęµ‹čÆ•é›†å‡†ē”®ēŽ‡å…³ē³»

  • t-SNEé™ē»“ļ¼šē”Øęœ€ä½³ē‰¹å¾å­é›†å®žēŽ°é«˜ē»“ę•°ę®äŗŒē»“ęŠ•å½±ļ¼Œé¢œč‰²ē¼–ē čÆŠę–­ē»“ęžœ

InĀ [105]:
import plotly.io as pio
from IPython.display import HTML, display
import plotly.graph_objects as go  

pio.renderers.default = "plotly_mimetype+notebook" 



def create_interactive_boxplots(X):  
    base_colors = ['indianred', 'mediumseagreen', 'dodgerblue', 'plum', 'darkkhaki',
                   'lightsalmon', 'gold', 'mediumturquoise', 'darkorange', 'lightgreen']
    all_colors = base_colors * (len(X.columns) // len(base_colors) + 1)
    
    html_contents = []
    
    for group_idx in range(0, len(X.columns), 5):
        group_cols = X.columns[group_idx:group_idx + 5]
        fig = go.Figure()
        
        for col, color in zip(group_cols, all_colors[:len(group_cols)]):
            fig.add_trace(go.Box(
                y=X[col],
                name=col,
                boxpoints='outliers',
                jitter=0.5,
                pointpos=0,
                whiskerwidth=0.2,
                fillcolor=color,
                marker=dict(size=3, color=color, line=dict(width=1, color='black')),
                line=dict(width=2, color='black'),
                opacity=0.8,
                hovertemplate=f'特征: {col}<br>值: %{{y}}<extra></extra>'
            ))
        
        fig.update_layout(
            title=f'å¼‚åøøå€¼å¤„ē†å‰ē‰¹å¾åˆ†åøƒ-黄云翔 - 组 {group_idx//5 + 1}',
            height=500,
            width=1000,
            plot_bgcolor='white',
            paper_bgcolor='white',
        
            title_font_color='black',  
            font=dict(
                family="Arial",
                size=12,
                color="black" 
            ),
         
            xaxis=dict(
                tickfont=dict(color='black'),
                title_font=dict(color='black')
            ),
            yaxis=dict(
                tickfont=dict(color='black'),
                title_font=dict(color='black')
            )
        )
        
        # č½¬ę¢äøŗHTMLå¹¶å­˜å‚Ø
        html_contents.append(fig.to_html(
            full_html=False,
            include_plotlyjs='cdn',
            config={'responsive': True}
        ))
    
    return html_contents

def display_in_notebook(html_contents):
    display(HTML("""
    <style>
        .plot-container {
            margin: 20px auto;
            width: 1000px;
            box-shadow: 0 0 10px rgba(0,0,0,0.1);
            padding: 15px;
        }
        .plot-title {
            text-align: center;
            font-size: 18px;
            margin: 10px 0;
            color: black;  # ę·»åŠ é¢œč‰²č®¾ē½®
        }
    </style>
    """))
    
    for i, html in enumerate(html_contents):
        display(HTML(f"""
        <div class="plot-title">ē‰¹å¾åˆ†åøƒå›¾ - 第 {i+1} 组</div>
        <div class="plot-container">{html}</div>
        """))

def save_full_html(html_contents, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(f"""
        <!DOCTYPE html>
        <html>
        <head>
            <meta charset="UTF-8">
            <title>ä¹³č…ŗē™Œē‰¹å¾åˆ†ęž</title>
            <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
            <style>
                body {{
                    font-family: Arial;
                    margin: 20px;
                    color: black;  # ę·»åŠ é¢œč‰²č®¾ē½®
                }}
                .plot-container {{
                    margin: 0 auto;
                    width: 1000px;
                }}
                h2 {{
                    color: #2c3e50;
                    text-align: center;
                }}
            </style>
        </head>
        <body>
            <h2>ä¹³č…ŗē™Œē‰¹å¾åˆ†ęžęŠ„å‘Š</h2>
            {"".join([f'<div class="plot-container">{html}</div>' for html in html_contents])}
        </body>
        </html>
        """)


if __name__ == "__main__":
    
    boxplot_htmls = create_interactive_boxplots(X)
    
    display_in_notebook(boxplot_htmls)
    
    # äæå­˜äøŗå®Œę•“HTMLꖇ件
    save_full_html(boxplot_htmls, "breast_cancer_analysis.html")
    print("å·²äæå­˜äøŗ breast_cancer_analysis.html")
ē‰¹å¾åˆ†åøƒå›¾ - 第 1 组
ē‰¹å¾åˆ†åøƒå›¾ - 第 2 组
ē‰¹å¾åˆ†åøƒå›¾ - 第 3 组
ē‰¹å¾åˆ†åøƒå›¾ - 第 4 组
ē‰¹å¾åˆ†åøƒå›¾ - 第 5 组
ē‰¹å¾åˆ†åøƒå›¾ - 第 6 组
å·²äæå­˜äøŗ breast_cancer_analysis.html

2.5 ęØ”åž‹ęž„å»ŗäøŽčÆ„ä¼°Ā¶

åŸŗēŗæęØ”åž‹ļ¼šé€»č¾‘å›žå½’+SMOTE

InĀ [106]:
from sklearn.metrics import confusion_matrix, classification_report

log_reg = LogisticRegression(max_iter=5000, random_state=42)  
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
class_report_log_reg = classification_report(y_test, y_pred_log_reg, output_dict=True)
print(f"é€»č¾‘å›žå½’ęØ”åž‹å‡†ē”®ēŽ‡: {accuracy_log_reg:.4f}")
print("ę··ę·†ēŸ©é˜µ:")
print(conf_matrix_log_reg)
print("åˆ†ē±»ęŠ„å‘Š:")
print(classification_report(y_test, y_pred_log_reg))
é€»č¾‘å›žå½’ęØ”åž‹å‡†ē”®ēŽ‡: 0.9737
ę··ę·†ēŸ©é˜µ:
[[70  1]
 [ 2 41]]
åˆ†ē±»ęŠ„å‘Š:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

ä¼˜åŒ–ęØ”åž‹ļ¼šXGBoost + RandomizedSearchCV + EarlyStopping + Threshold Optimization

InĀ [107]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, roc_curve, accuracy_score, confusion_matrix, classification_report
from xgboost import XGBClassifier
import numpy as np
from scipy.stats import uniform, randint

X_train = np.nan_to_num(X_train, nan=0.0)
y_train = np.nan_to_num(y_train, nan=0.0)
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])

xgb = XGBClassifier(
    eval_metric='logloss',
    random_state=42,
    scale_pos_weight=scale_pos_weight,
    tree_method='hist'
)

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(2, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'gamma': uniform(0, 0.5),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(0, 2),
    'min_child_weight': randint(1, 10)
}

random_search = RandomizedSearchCV(
    xgb, 
    param_dist, 
    n_iter=100,
    cv=5, 
    scoring='f1_weighted',
    n_jobs=1, 
    verbose=1,
    random_state=42,
    error_score='raise'  
)
random_search.fit(X_train, y_train)

print(f"ęœ€ä½³å‚ę•°: {random_search.best_params_}")

X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

best_params = random_search.best_params_.copy()
original_n_estimators = best_params.pop('n_estimators', None)

xgb_early = XGBClassifier(
    **best_params,  
    eval_metric='logloss',
    random_state=42,
    n_estimators=1000,  
    early_stopping_rounds=50,  
    tree_method='hist',  
    device='cpu'        
)

xgb_early.fit(
    X_train_sub, 
    y_train_sub,
    eval_set=[(X_val, y_val)],  
    verbose=False
)

best_iter = xgb_early.best_iteration

final_xgb = XGBClassifier(
    **best_params,  
    n_estimators=best_iter,  
    eval_metric='logloss',
    random_state=42,
    tree_method='hist',  
    device='cpu'        
)
final_xgb.fit(X_train, y_train)

best_xgb = final_xgb

y_proba = best_xgb.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
gmeans = np.sqrt(tpr * (1 - fpr))  # å‡ ä½•å¹³å‡ę•°
ix = np.argmax(gmeans)
best_thresh = thresholds[ix]

# 添加F_betaåˆ†ę•°ä¼˜åŒ–ļ¼ˆę›“ę³Øé‡å¬å›žēŽ‡ļ¼‰
beta = 1.2  # å¬å›žēŽ‡ęƒé‡>ē²¾ē”®ēŽ‡
f_beta_scores = []
for thresh in thresholds:
    y_pred_temp = (y_proba >= thresh).astype(int)
    precision = precision_score(y_test, y_pred_temp, zero_division=0)
    recall = recall_score(y_test, y_pred_temp)
    if (precision + recall) > 0:
        f_beta = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
    else:
        f_beta = 0
    f_beta_scores.append(f_beta)
    
best_f_beta_ix = np.argmax(f_beta_scores)
best_f_beta_thresh = thresholds[best_f_beta_ix]

# é€‰ę‹©ęœ€ä½³é˜ˆå€¼ļ¼ˆä¼˜å…ˆF_betaä¼˜åŒ–ēš„é˜ˆå€¼ļ¼‰
y_pred_xgb = (y_proba >= best_f_beta_thresh).astype(int)

# čÆ„ä¼°ęŒ‡ę ‡ļ¼ˆäæęŒåŽŸå˜é‡åļ¼‰
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
class_report_xgb = classification_report(y_test, y_pred_xgb)

print(f"åŽŸå§‹ęœ€ä½³ę ‘ę•°é‡: {original_n_estimators}")
print(f"ę—©åœęœ€ä½³čæ­ä»£ę¬”ę•°: {best_iter}")
print(f"åŸŗäŗŽG-Meanēš„ęœ€ä½³é˜ˆå€¼: {best_thresh:.4f}")
print(f"åŸŗäŗŽF{beta}åˆ†ę•°ēš„é˜ˆå€¼: {best_f_beta_thresh:.4f}")
print(f"ä¼˜åŒ–åŽēš„XGBoostęØ”åž‹å‡†ē”®ēŽ‡: {accuracy_xgb:.4f}")
print("ę··ę·†ēŸ©é˜µ:")
print(conf_matrix_xgb)
print("åˆ†ē±»ęŠ„å‘Š:")
print(class_report_xgb)

print("ęœ€ä½³ęØ”åž‹å·²äæå­˜äøŗbest_xgb")
Fitting 5 folds for each of 100 candidates, totalling 500 fits
ęœ€ä½³å‚ę•°: {'colsample_bytree': np.float64(0.8346141738643036), 'gamma': np.float64(0.28222927126132064), 'learning_rate': np.float64(0.12363178778538679), 'max_depth': 6, 'min_child_weight': 6, 'n_estimators': 229, 'reg_alpha': np.float64(0.6459172413316012), 'reg_lambda': np.float64(1.1415566093378238), 'subsample': np.float64(0.7424386903591385)}
åŽŸå§‹ęœ€ä½³ę ‘ę•°é‡: 229
ę—©åœęœ€ä½³čæ­ä»£ę¬”ę•°: 91
åŸŗäŗŽG-Meanēš„ęœ€ä½³é˜ˆå€¼: 0.4970
åŸŗäŗŽF1.2åˆ†ę•°ēš„é˜ˆå€¼: 0.4970
ä¼˜åŒ–åŽēš„XGBoostęØ”åž‹å‡†ē”®ēŽ‡: 0.9825
ę··ę·†ēŸ©é˜µ:
[[71  0]
 [ 2 41]]
åˆ†ē±»ęŠ„å‘Š:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        71
           1       1.00      0.95      0.98        43

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

ęœ€ä½³ęØ”åž‹å·²äæå­˜äøŗbest_xgb

3. å…³é”®ē»“ęžœĀ¶

åÆ¹ęÆ”é€»č¾‘å›žå½’å’ŒXGBoostēš„ę€§čƒ½ęŒ‡ę ‡

åˆ›å»ŗå¤šęŒ‡ę ‡ęŸ±ēŠ¶å›¾åÆ¹ęÆ”ęØ”åž‹ę€§čƒ½

绘制ROCę›²ēŗæęÆ”č¾ƒAUC值

绘制精甮度-å¬å›žēŽ‡ę›²ēŗæ

InĀ [109]:
# ęØ”åž‹åÆ¹ęÆ”éƒØåˆ†
print("ęØ”åž‹åÆ¹ęÆ”ļ¼š")
print(f"é€»č¾‘å›žå½’å‡†ē”®ēŽ‡: {accuracy_log_reg:.4f}")
print(f"ä¼˜åŒ–åŽēš„XGBoostå‡†ē”®ēŽ‡: {accuracy_xgb:.4f}")
print("\né€»č¾‘å›žå½’ę··ę·†ēŸ©é˜µ:")
print(conf_matrix_log_reg)
print("\nä¼˜åŒ–åŽēš„XGBoostę··ę·†ēŸ©é˜µ:")
print(conf_matrix_xgb)

# ē”®äæåˆ†ē±»ęŠ„å‘Šä½æē”Øē›øåŒēš„ę ¼å¼
class_report_log_reg = classification_report(
    y_test, y_pred_log_reg, 
    output_dict=True,
    target_names=['0', '1']
)

class_report_xgb = classification_report(
    y_test, y_pred_xgb, 
    output_dict=True,
    target_names=['0', '1']
)

print("\né€»č¾‘å›žå½’åˆ†ē±»ęŠ„å‘Š:")
print(classification_report(y_test, y_pred_log_reg))
print("\nä¼˜åŒ–åŽēš„XGBooståˆ†ē±»ęŠ„å‘Š:")
print(classification_report(y_test, y_pred_xgb))

# 讔算特异性 (Specificity) = TN / (TN + FP)
def calculate_specificity(conf_matrix):
    tn, fp, fn, tp = conf_matrix.ravel()
    return tn / (tn + fp) if (tn + fp) > 0 else 0

specificity_log_reg = calculate_specificity(conf_matrix_log_reg)
specificity_xgb = calculate_specificity(conf_matrix_xgb)

# é€»č¾‘å›žå½’ęŒ‡ę ‡ļ¼ˆē±»åˆ«1)
precision_log_reg = class_report_log_reg['1']['precision']
recall_log_reg = class_report_log_reg['1']['recall']
f1_log_reg = class_report_log_reg['1']['f1-score']

# XGBoostęŒ‡ę ‡ļ¼ˆē±»åˆ«1)
precision_xgb = class_report_xgb['1']['precision']
recall_xgb = class_report_xgb['1']['recall']
f1_xgb = class_report_xgb['1']['f1-score']

metrics = ['å‡†ē”®ēŽ‡', 'å¬å›žēŽ‡', '特异性', 'ē²¾ē”®ēŽ‡', 'F1值']
log_reg_scores = [accuracy_log_reg, recall_log_reg, specificity_log_reg, precision_log_reg, f1_log_reg]
xgb_scores = [accuracy_xgb, recall_xgb, specificity_xgb, precision_xgb, f1_xgb]

x = np.arange(len(metrics)) 
width = 0.35  

fig, ax = plt.subplots(figsize=(12, 7))
bars1 = ax.bar(x - width/2, log_reg_scores, width, label='é€»č¾‘å›žå½’', color='#1f77b4', edgecolor='black')
bars2 = ax.bar(x + width/2, xgb_scores, width, label='XGBoost', color='#ff7f0e', edgecolor='black')

ax.set_xlabel('čÆ„ä¼°ęŒ‡ę ‡', fontsize=12)
ax.set_ylabel('分值', fontsize=12)
ax.set_title('é€»č¾‘å›žå½’äøŽ XGBoost ęØ”åž‹ę€§čƒ½åÆ¹ęÆ”-黄云翔', fontsize=14, fontweight='bold', pad=20)
ax.set_xticks(x)
ax.set_xticklabels(metrics, fontsize=11)
ax.legend(fontsize=11)
ax.grid(True, axis='y', linestyle='--', alpha=0.7)
ax.set_ylim(0.85, 1.0)  # 优化Yč½“čŒƒå›“ä»„ēŖå‡ŗå·®å¼‚

# č°ƒę•“ē™¾åˆ†ęÆ”ę ‡ē­¾ä½ē½®ļ¼Œå¢žåŠ yč½“åē§»é‡
for bar in bars1:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval - 0.01, 
            f'{yval:.1%}',  
            ha='center', va='bottom', fontsize=9)

for bar in bars2:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval - 0.01, 
            f'{yval:.1%}', 
            ha='center', va='bottom', fontsize=9)

plt.figtext(0.5, 0.01, 
            f"é€»č¾‘å›žå½’å‡†ē”®ēŽ‡: {accuracy_log_reg:.1%} | XGBoostå‡†ē”®ēŽ‡: {accuracy_xgb:.1%} | ęå‡: {(accuracy_xgb - accuracy_log_reg):.1%}",
            ha="center", fontsize=11, bbox=dict(facecolor='lightgray', alpha=0.5))

plt.tight_layout(rect=[0, 0.05, 1, 0.95])  
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# äæ®å¤ROCę›²ēŗæéƒØåˆ† - ē”®äæä½æē”Øę­£ē”®ēš„ęµ‹čÆ•é›†
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve

# ē”®äæęµ‹čÆ•é›†å¤„ē†äø€č‡“
X_test_processed = np.nan_to_num(X_test, nan=0.0)

# é€»č¾‘å›žå½’é¢„ęµ‹ę¦‚ēŽ‡
y_pred_prob_log_reg = log_reg.predict_proba(X_test_processed)[:, 1]

# XGBoosté¢„ęµ‹ę¦‚ēŽ‡
y_pred_prob_xgb = best_xgb.predict_proba(X_test_processed)[:, 1]

# 讔算AUC值
auc_log_reg = roc_auc_score(y_test, y_pred_prob_log_reg)
auc_xgb = roc_auc_score(y_test, y_pred_prob_xgb)

fpr_log_reg, tpr_log_reg, _ = roc_curve(y_test, y_pred_prob_log_reg)
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_pred_prob_xgb)

plt.figure(figsize=(10, 6))
plt.plot(fpr_log_reg, tpr_log_reg, label=f'é€»č¾‘å›žå½’ (AUC = {auc_log_reg:.3f})', linewidth=2)
plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {auc_xgb:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlabel('å‡é˜³ę€§ēŽ‡', fontsize=12)
plt.ylabel('ēœŸé˜³ę€§ēŽ‡', fontsize=12)
plt.title('ROC 曲线-黄云翔', fontsize=14, fontweight='bold', pad=20)
plt.legend(fontsize=11)
plt.grid(True, linestyle='--', alpha=0.7)
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

# 绘制精甮度-å¬å›žēŽ‡ę›²ēŗæ
precision_log_reg, recall_log_reg, _ = precision_recall_curve(y_test, y_pred_prob_log_reg)
precision_xgb, recall_xgb, _ = precision_recall_curve(y_test, y_pred_prob_xgb)

plt.figure(figsize=(10, 6))
plt.plot(recall_log_reg, precision_log_reg, label='é€»č¾‘å›žå½’', linewidth=2)
plt.plot(recall_xgb, precision_xgb, label='XGBoost', linewidth=2)
plt.xlabel('å¬å›žēŽ‡', fontsize=12)
plt.ylabel('精甮度', fontsize=12)
plt.title('精甮度-å¬å›žēŽ‡ę›²ēŗæ-黄云翔', fontsize=14, fontweight='bold', pad=20)
plt.legend(fontsize=11)
plt.grid(True, linestyle='--', alpha=0.7)
plt.savefig('pr_curve.png', dpi=300, bbox_inches='tight')
plt.show()
ęØ”åž‹åÆ¹ęÆ”ļ¼š
é€»č¾‘å›žå½’å‡†ē”®ēŽ‡: 0.9737
ä¼˜åŒ–åŽēš„XGBoostå‡†ē”®ēŽ‡: 0.9825

é€»č¾‘å›žå½’ę··ę·†ēŸ©é˜µ:
[[70  1]
 [ 2 41]]

ä¼˜åŒ–åŽēš„XGBoostę··ę·†ēŸ©é˜µ:
[[71  0]
 [ 2 41]]

é€»č¾‘å›žå½’åˆ†ē±»ęŠ„å‘Š:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114


ä¼˜åŒ–åŽēš„XGBooståˆ†ē±»ęŠ„å‘Š:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        71
           1       1.00      0.95      0.98        43

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InĀ [108]: